IBM Books

Nways Manager for AIX-LAN Network Manager/I.H.M.P. User's Guide


Automatic Handling of Management Module Changes

When managing 8260 Hubs, you may sometimes perform management tasks that result in a loss of connection between the target hub and the management station. When this happens, 8250, 8260, and 8265 Device Manager uses a mechanism to automatically recover from the SNMP errors.

In order to work properly, the recovery mechanism requires you to configure your hubs and management station in a certain way. The required configurations are described in the next section, Required Configurations for Automatic Recovery.


Required Configurations for Automatic Recovery

The following prerequisites are required for automatic recovery and change handling:


Understanding the SNMP Recovery Process

In all cases, the panels involving the faulty Management module are invalidated prior to starting the recovery. On these panels, only the Close and Help pushbuttons are available. If a panel is open, a pop-up window displays the result of the SNMP recovery. When you click on the OK pushbutton all invalidated panels are revalidated.

Messages describing each SNMP recovery step are logged in the NetView for AIX log, /usr/OV/log/nettl.LOG00. To display the contents of this log, enter the command:

/usr/OV/bin/netfmt -f nettl.LOG00

If connection with the master agent is lost, the lost hub connection icon is displayed in the Hub Level view and the hub status changes to red.

A hub poll is started by the SNMP recovery process and is performed for all hubs regardless of what status (managed or unmanaged) and polling policy they have been configured with under NetView for AIX.


Recoverable Situations

Recoverable situations include:

Mastership reelection

Explanation: Occurs when a No Such Name SNMP error occurs on the hub master agent. You have requested the hub configuration from a slave agent that no longer implements the MIB variables.

System Action: All agents in the hub are requested to report their master status.

Network assignment change

Explanation: A Time-Out SNMP error has occurred on a hub agent. You are either trying to communicate with an agent that no longer uses the same IP address or that has been removed from the hub, or you are using an incorrect community name.

System Action: When the error occurs on a master agent, each master agent interface in the hub is tried. If this is unsuccessful, all agents in the hub are requested to report their master status.

When the error occurs on a slave agent, a hub poll is started.

Lost connection with the master agent

Explanation: Following a mastership reelection or a network assignment change, the connection with the master agent may be lost. If there is still connectivity with another agent in the same hub, recovery from the lost connection may be initiated.

System Action: Check that the prerequisites are met (see Recovery of Lost Connection with Master) and initiate the recovery according to the user customization of this function.


Recovery of Lost Connection with Master

After losing connection with the master agent, hub management capability can be recovered by means of the SNMP recovery mechanism, Lost-Connection-with-Master. The lost connection may be due to a network reassignment or a mastership reelection. If the loss of connection persists due to a network problem, the Lost-Connection-with-Master mechanism tries to recover the situation.

Prerequisites

The prerequisites for the Lost-Connection-with-Master SNMP recovery mechanism are as follows:

Basic Principles

The Lost-Connection-with-Master SNMP recovery takes place after a traditional SNMP recovery failure in Lost-Connection-with-Master-Agent. It relies on the capability of the Management modules to set their own mastership priority and to trigger a mastership reelection for the hub, even if they are slaves.

The basic algorithm is as follows:

  1. Check whether the prerequisites are met. If not, no recovery can be started and a Warning message is displayed.

  2. Elect a slave agent candidate to become the new master.

    The candidate agent is chosen based on its current mastership priority. The agent with the highest priority is chosen. This simple algorithm allows priorities for back-up Master Management modules to be specified by specifying accurate mastership priorities to its Management modules.

  3. Set the mastership priority to the highest value.

    1. If the change in mastership priority is successful, connectivity is established with the agent. Then trigger a mastership reelection through the agent.

    2. If the pdu is acknowledged by the agent, after waiting a few seconds to let the hardware complete its reelection process, trigger a traditional SNMP recovery.

    3. If the recovery is successful, the chosen agent is the new master.

    4. Set the new master agent mastership priority back to its original value, even if step c) was not successful.

  4. If step 3) or b) have failed, repeat step 1) with the next eligible slave agent.
Note:If step b) failed, it is probably due to a severe hardware problem or system error which is not recoverable.

Configuration Parameters at Application Level using SMIT

Configuration Parameters at Application Level

To recover a lost connection with a Master agent, you use SMIT to configure certain application parameters. To configure these parameters, follow these steps:

  1. Open the Root window or the IBM Hubs Topology.

  2. From the menu bar, select Administer -> Campus Manager SMIT -> Configure -> CML Hub Manager capability configuration -> Change the SNMP recovery configuration.

In the panel that is displayed, configure the following parameters:


SNMP Recovery Pop-Up Messages

The general format of all pop-up messages is:
Pop-up Identifier
Result of the recovery
The SNMP error that was detected
Optionally, additional information

Pop-Up Identifier

The pop-up identifier consists of:

Result of the Recovery

The result of the recovery may be any one of the recovery messages described in Recovery Messages.

SNMP Error Detected

The SNMP error that was originally detected consists of four parts:

Additional Information

The additional information included at the end of the pop-up message is related to the recovery of the lost connection with the master agent (see Optional Information).


Recovery Messages

Master agent changed: old master agent IP address nnn.nnn.nnn.nnn new master: nnn.nnn.nnn.nnn hub polling started.

Explanation: A No Such Name SNMP error has been detected on the master agent, or on a slave agent during polling. The new master agent has been found.

System Action: A hub poll is started.

Connectivity reestablished after a time out. New master agent IP address used: nnn.nnn.nnn.nnn Old IP address used: nnn.nnn.nnn.nnn hub polling started.

Explanation: A Time-Out SNMP error has been detected at the master agent, but a new IP address has been found. The master agent was probably assigned to a new network, but its IP address was up to date in the NetView for AIX database.

System Action: A hub poll is started.

Connectivity reestablished after a time out. Same master agent IP address nnn.nnn.nnn.nnn used.

Explanation: A Time-Out SNMP error was detected at the master agent, but connectivity was reestablished with the same IP address. This generally occurs when trying to write with an incorrect community name.

User Response: Check your SNMP configuration on both the agent side and the network management station side.

Probable configuration change: hub polling started.

Explanation: This message occurs in two situations:

System Action: A hub poll is started.

Lost connection with master agent due to system error: next polling in n minutes.

Explanation: A No Such Name SNMP error has been detected on the master agent, or on a slave during polling. Due to a system error, the master status of the agents in the hub was not retrieved.

System Action: A hub poll will start in n minutes, where n is the polling interval multiplied by 5.

User Response: This error could also be due to a problem with the SNMP NetView for AIX API, for example. Check the log to find the system error.

Non-recoverable loss of connection during the polling of the last agent known as master.

Explanation: A Time-Out SNMP error has been detected on the last agent known as master. This occurs if you poll a hub after a mastership reelection that was not successfully handled by 8250, 8260, and 8265 Device Manager. All agents are slaves, so 8250, 8260, and 8265 Device Manager polls the last agent known as master.

System Action: None

User Response: If you wish to force a mastership reelection in order to let a slave agent become master, request a manual hub poll and follow the instructions given for the Lost-Connection-with-Master recovery.

Attempt to retrieve a MIB variable unknown by the agent: next polling in n minutes.

Explanation: A No Such Name SNMP error has been detected on the master agent during polling, and the master is still master.

The agent's version is not fully supported by 8250, 8260, and 8265 Device Manager.

System Action: A hub poll is started in n minutes, where n is the polling interval multiplied by 5.

User Response: Use SMIT to change the default version for this agent to the lowest value supported.

Probable mismatch between the agent version and the MIB variable requested.

Explanation: A No Such Name SNMP error has been detected from a panel on a slave agent.

The agent's version is not fully supported by 8250, 8260, and 8265 Device Manager.

System Action: None.

User Response: Use SMIT to change the default version for this agent to the lowest value supported.

Mastership reelection in progress: SNMP recovery stopped. You might want to perform a manual request hub poll in a few seconds.

Explanation: A No Such Name SNMP error has been detected on the master agent (or on a slave during polling) and the agents are electing.

System Action: SNMP recovery is stopped.

User Response: Perform a manually requested poll in a few seconds.

Probable configuration change. Polling in progress. Please try again.

Explanation: An SNMP error has been detected from a panel.

System Action: A poll is taking place.

Probable configuration change. Recovery already in progress. Please try again.

Explanation: An SNMP error has been detected from a panel.

System Action: Recovery is already in progress for that hub for another IP address.

Agent IP address nnn.nnn.nnn.nnn is no longer accurate.

Explanation: An SNMP error has been detected from a panel involving an agent that is no longer in the hub.

System Action: None.

User Response: Close the panel.

Lost connection with master agent due to a non-recoverable SNMP error.

Explanation: SNMP recovery has been started for an error that is not No Such Name or Time-Out. A system error occurred during the polling.

System Action: None.

User Response: Check the log to find the error.

Last known master agent became slave and the connection was lost with the new master. The lost connection with master agent recovery forced agent nnn.nnn.nnn.nnn master again.

System Action: None.

User Response: None.

The connection with the master agent was lost. The following agents are candidates to become master:   Type Priority IP address xMM x nnn.nnn.nnn.nnn xMM x nnn.nnn.nnn.nnn   Do you want Hub Manager to attempt to force a mastership reelection?

Explanation: SNMP recovery has been started for an error that is not No Such Name or Time-Out. A system error occurred during the polling.

System Action: None

User Response: Check the log to find the error.

Lost connection with master agent or cannot find master agent in the hub: next polling in n minutes.

Explanation: A No Such Name SNMP error has been detected on the master agent (or on a slave during polling) and all agents are slaves. The lost connection with master SNMP recovery was not triggered. This arises if a new Management module has been plugged into the hub and is now master, but 8250, 8260, and 8265 Device Manager was not aware of this change.

This message could also be due to an SNMP error (such as a Time-Out) where the master status of the agents in the hub could not be retrieved.

System Action: A hub poll is started in n minutes, where n is the polling interval multiplied by 5.


Optional Information

The following optional information may appear at the end of the pop-up message.

The lost connection with master agent recovery can not be attempted because no slave agent was responding.

Explanation: The slave agents do not respond to SNMP requests.

System Action: None.

User Response: Check the connectivity between the management station and the agents and the customization of the community names.

The lost connection with master agent recovery can not be attempted because it was already attempted.

Explanation: The recovery of lost connection with master was attempted but did not succeed in letting one slave agent become the new master and so recovery is stopped.

System Action: None.

User Response: Check mastership priorities for this hub. The old master probably has a priority of ten.

The lost connection with master recovery was initiated but failed due to system error.

Explanation: The recovery of lost connection with master was attempted but did not succeed.

System Action: None.

User Response: None.

The lost connection with master recovery was not initiated

Explanation: The recovery of lost connection with master was not attempted either because the user has answered NO to the pop-up or the default action was NORECOVER.

System Action: None.

User Response: None.

The lost connection with master recovery was initiated but failed.

Explanation: The recovery of lost connection with master was attempted but failed.

System Action: None.

User Response: None.


[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]